Write documentation for humans. Write data for computers.
data/raw/ data/clean/ processing/ analysis/
Preferred:
work_phone vs. home_phone)Messy data is usually messy in the following ways:
Hadley Wickham’s (2014) Tidy Data paper.
Do read it for details and more formal definitions.
R heavy, but many tools do the same)knitr::kable(x[106, 1:4], row.names = FALSE)| Timestamp | Name | Email address | Which events do you plan to attend on Sunday, July 30? |
|---|---|---|---|
| 7/17/2017 10:45:31 | Fred Satterfield | px@kiwabmx.bzq | [Saturday, July 29th, 7p] Dinner/drinks party at Brian Nosek’s home (bus transportation to/from Omni hotel all evening), Workshop: Fundamentals of meta-analysis (9:30a-12:30p), Workshop: Using OSF (1:30-3:30p), Workshop: Writing papers to be transparent, reproducible, and fabulous (3:30-5:30p), Hack: Resources for changing culture in department, university, society (all day) |
| Which events do you plan to attend on Monday, July 31? (not including lightning talks, unconference sessions, and hackathons that will continue or start on 2nd day) | Which events do you plan to attend on Tuesday, August 1st? |
|---|---|
| Workshop: R Markdown (1:30-3:30p), Workshop: Power analysis and effect size (3:30p-5:30p), All Conference Social Event (Dinner+drinks; 7:30p-??) | Workshop: Fundamentals of R (9:30a-12:30p) |
# Read the file in with better headers
library(readr)
x <- read_csv('data/raw/plans_raw.csv',
col_names = c("time_submitted", "name", "email",
"day1", "day2", "day3"),
col_types = list(
col_datetime(format = '%m/%d/%Y %H:%M:%S'),
col_character(), col_character(),
col_character(), col_character(),
col_character()),
skip = 1) # skip 1 row (headers)
head(x)## # A tibble: 6 x 6
## time_submitted name email
## <dttm> <chr> <chr>
## 1 2017-07-14 11:54:46 Bernadine Ratke ad@xosz.tdi
## 2 2017-07-14 10:13:10 Juliane Kling jbacqmnoug@qcsrv.hut
## 3 2017-07-14 10:14:02 Argelia DuBuque sbinxjw@bprhc.ybo
## 4 2017-07-14 10:16:39 Williams Mueller xrugcmajh@qbaetkvwmj.kwe
## 5 2017-07-14 11:17:02 Davida Ledner iokv@vl.dyt
## 6 2017-07-16 03:07:14 Ethan Klocko kl@jbn.vmb
## # ... with 3 more variables: day1 <chr>, day2 <chr>, day3 <chr>
knitr::kable(x[106, 4:6], row.names = FALSE)| day1 | day2 | day3 |
|---|---|---|
| [Saturday, July 29th, 7p] Dinner/drinks party at Brian Nosek’s home (bus transportation to/from Omni hotel all evening), Workshop: Fundamentals of meta-analysis (9:30a-12:30p), Workshop: Using OSF (1:30-3:30p), Workshop: Writing papers to be transparent, reproducible, and fabulous (3:30-5:30p), Hack: Resources for changing culture in department, university, society (all day) | Workshop: R Markdown (1:30-3:30p), Workshop: Power analysis and effect size (3:30p-5:30p), All Conference Social Event (Dinner+drinks; 7:30p-??) | Workshop: Fundamentals of R (9:30a-12:30p) |
(tidy data illustration)
Tidy solution: divide the data units into tables you can re-merge
people <- select(x, time_submitted, name, email) %>% unique()
people$data_origin <- 'plans_raw.csv' # if many possible sources
# if non-existent, create an ID to join the tables by.
# (Usually, you might have an ID for each participant at
# this point, which you'd now deidentify.)
people$participant_ID = sample(1:nrow(people), nrow(people))
# separate out the dataset, enable later connection with index
x <- merge(x, people) %>% select(-time_submitted, -name, -email)As a bonus, sensitive info is relegated to a single spot, and is easier to deal with:
# we don't want to share the name info, but we can extract
# non-identifying metadata of interest
people <- transmute(people,
time_submitted, participant_ID,
email_server = gsub('.+@([^@]+)$', '\\1', email),
tld = gsub('.+\\.(\\w{1,7}$)', '\\1', email),
number_names = length(strsplit(name, ' ')[[1]])
)| time_submitted | participant_ID | email_server | tld | number_names |
|---|---|---|---|---|
| 2017-07-14 11:54:46 | 8 | xosz.tdi | tdi | 2 |
...August 1 to day1, that is still plan-level info| day1 | day2 | day3 | data_origin | participant_ID |
|---|---|---|---|---|
| [Saturday, July 29th, 7p] Dinner/drinks party at Brian Nosek’s home (bus transportation to/from Omni hotel all evening), Hack: Syllabus for Research Methods (all day) | Workshop: R Markdown (1:30-3:30p), Re-hack: Diversity and inclusion (morning), All Conference Social Event (Dinner+drinks; 7:30p-??) | NA | plans_raw.csv | 77 |
Tidy solution: translate the data from wide to long
x <- tidyr::gather(x, event_day, event, -participant_ID)
x$event_day <- as.numeric(gsub('day', '', x$event_day)) # clean up## Warning: NAs introduced by coercion
| participant_ID | event_day | event |
|---|---|---|
| 77 | 1 | [Saturday, July 29th, 7p] Dinner/drinks party at Brian Nosek’s home (bus transportation to/from Omni hotel all evening), Hack: Syllabus for Research Methods (all day) |
| 53 | 1 | Workshop: Fundamentals of meta-analysis (9:30a-12:30p), Workshop: Using OSF (1:30-3:30p), Workshop: Writing papers to be transparent, reproducible, and fabulous (3:30-5:30p) |
NA. Why?Tidy solution: One row per one planned event
(assuming we fixed the comma issue)
x <- separate_rows(x, event, sep = ', ')We can now easily fix the mis-dating of Brian’s social, which - as an artifact of survey design - would otherwise still be considered a Day 1 event:
# With just one mis-dated event, we can just re-date it
x$event_day[x$event == paste0("[Saturday, July 29th, 7p] ",
"Dinner/drinks party at Brian Nosek's home (bus ",
"transportation to/from Omni hotel all evening)")] <- 0| participant_ID | event_day | event |
|---|---|---|
| 50 | 1 | Workshop: Fundamentals of meta-analysis (9:30a-12:30p) |
event: event time, event name, and event type# Structure of `event` is stable (Type: name (time)), so
# a regular expression can extract it
#
# (Not shown: slight data manipulation to make this work for all events)
x <- extract(x, event,
c('event_type', 'event_name', 'event_time'),
regex = "(^[[:alnum:]-]+): (.+) \\((.+)\\)$",
remove = FALSE)NA event mean?| participant_ID | event_day | event | event_type | event_name | event_time |
|---|---|---|---|---|---|
| 50 | 1 | Workshop: Fundamentals of meta-analysis (9:30a-12:30p) | Workshop | Fundamentals of meta-analysis | 9:30a-12:30p |
flexible_responses <- c("Everything else will be scheduled during the conference depending on what has legs and group interest",
"I am flexible; if there is a hackathon that needs help, let me know.")
# Removal:
# x <- filter(x, !(event %in% flexible_responses))
# Re-definition
x$event_type[x$event %in% flexible_responses] <- "Flexible"people table would probably run into duplicates.event_dateevent_time to event_starttime and event_durationevent, as we’ve extracted everything out of itevents.csv into its own table and join by event_idx <- select(x, -event)
write_csv(x, 'data/clean/event_plans.csv')
write_csv(people, 'data/clean/people.csv')| participant_ID | event_day | event_type | event_name | event_time |
|---|---|---|---|---|
| 50 | 1 | Workshop | Fundamentals of meta-analysis | 9:30a-12:30p |
…but if you’re looking for R, tidyverse is excellent
people.csv Short description: Properties of SIPS attendees B. Filename: event_plans.csvpeople.csv provides participant data for event_plans.csv (one-to-many)tidying_sips_plans.Rmdevent_plans.csvparticipant_ID
people.csv B. Name: event_typeNA (and meanings) C. Name: …event_plans.csv (cont’d)people.csv…